GVPT Maths Camp

Exploratory Data Analysis

Learning objectives

  1. Learn how to answer basic questions about your data

  2. Learn how to identify interesting relationships in your data

  3. Use your new data science tools to better understand your data

  4. Upload your scripts to Github

A familiar problem

A solution

  • Track changes to your documents, code, or data over time

  • Work from one document

  • Have access to your work from anywhere

  • Create safe points in case something breaks or you want to experiment

Git and Github

Open source version control software.


Think R.

A website that allows you to store your Git repositories online and makes it easy to collaborate with others.


Think RStudio.

Why should I use Git and Github? 🤔

  • More reproducible, transparent research

  • Better version control

  • Easy collaboration with others

The basics

Four verbs you need to know to use Git for version control:

  1. add

  2. commit

  3. push

  4. pull

Using Git in RStudio

Three different options:

  1. RStudio GUI

  2. Shell/terminal

  3. Github desktop1

Repositories

  • A repository is like a folder for your project, but better!

  • Organises your work

  • Displays useful information, including a general description, navigation, changes

  • A great tool for project-oriented workflows

Starting a new project: create a repository

Starting a new project: create a repository

Starting a new project: create a repository

Sync your online repository with RStudio: from scratch

Sync your online repository with RStudio: from scratch

Sync your online repository with RStudio: from scratch

Sync your online repository with RStudio: existing R project

  • We already have R projects that we started yesterday.

  • We can sync the existing R project with our new repository.

Sync your online repository with RStudio: existing R project

The usethis R package is a brilliant helper package.

install.packages("usethis")


usethis::create_from_github(
  "https://github.com/YOU/YOUR_REPO.git",
  destdir = "~/path/to/where/you/want/the/local/repo/"
)

RStudio GUI: Workflow

pull any changes made and stored in your Github repository before making your changes

add those changes to your staging area

commit your changes with a meaningful message

push those committed changes up to Github

Shell/terminal: Workflow

pull any changes made and stored in your Github repository before making your changes

add those changes to your staging area

commit your changes with a meaningful message

push those committed changes up to Github

Working with others

Github is like Google Docs for your code.

EXERCISE

  1. Create a new Github repository for this camp.

  2. Sync your existing R project to this new repository.

  3. add your scripts from yesterday and today.

  4. Write a helpful commit message for your future self.

  5. push your work up to Github.

  6. Add me as a collaborator: @hgoers.

Two basic questions to guide your EDA

Exploratory data analysis is a critical step in your quantitative research process.

  1. What type of variation occurs within my variables?

  2. What type of covariation occurs between my variables?

Examining gapminder

library(tidyverse)
library(gapminder)

head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

Variation

What is the earliest and latest year we cover?

summarise(gapminder, min(year), max(year))
# A tibble: 1 × 2
  `min(year)` `max(year)`
        <int>       <int>
1        1952        2007

What about our other numeric variables?

summarise(gapminder, across(lifeExp:gdpPercap, ~ quantile(.x)))
# A tibble: 5 × 3
  lifeExp         pop gdpPercap
    <dbl>       <dbl>     <dbl>
1    23.6      60011       241.
2    48.2    2793664      1202.
3    60.7    7023596.     3532.
4    70.8   19585222.     9325.
5    82.6 1318683096    113523.

The Five Number Summary

The five number summary is a useful way to summarise numeric data. Consists of the:

  • Minimum,

  • 25th percentile,

  • 50th percentile (mean or average),

  • 75th percentile,

  • Maximum

Visualising the Five Number Summary

library(ggplot2)

ggplot(gapminder, aes(y = lifeExp)) + 
  geom_boxplot() + 
  theme_minimal()

Visualising the IQR for groups

library(ggplot2)

ggplot(gapminder, aes(x = continent, y = lifeExp)) + 
  geom_boxplot() + 
  theme_minimal()

Visualising the distribution of numeric variables

ggplot(gapminder, aes(x = lifeExp)) + 
  geom_histogram() + 
  theme_minimal()

Visualising the distribution of numeric variables

ggplot(gapminder, aes(x = lifeExp)) + 
  geom_density() + 
  theme_minimal()

Visualising the distribution of numeric variables

ggplot(gapminder, aes(x = lifeExp, fill = continent)) + 
  geom_density(alpha = 0.5) + 
  theme_minimal()

Visualising counts

gapminder |>
  distinct(continent, country) |> 
  count(continent) |> 
  ggplot(aes(x = n, y = reorder(continent, n))) + 
  geom_col() + 
  theme_minimal()

Identifying unusual values

ggplot(gapminder, aes(x = gdpPercap)) + 
  geom_histogram() + 
  theme_minimal()

Identifying unusual values

ggplot(gapminder, aes(x = gdpPercap)) + 
  geom_boxplot() + 
  theme_minimal()

Identifying relationships in your data

Does one variable tend to move in the same direction as another?

ggplot(gapminder, aes(x = log(gdpPercap), y = lifeExp)) + 
  geom_point() + 
  theme_minimal()

A preview of linear regression

ggplot(gapminder, aes(x = log(gdpPercap), y = lifeExp)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm") + 
  theme_minimal()

There has to be an easier way!

A quick look with glimpse():

glimpse(gapminder)

A quick summary with skim():

install.packages("skimr")

skimr::skim(gapminder)

Summary

Today you:

  1. Learnt how to explore and visualise interesting relations in your data

  2. Used your new data science tools to better understand your data